Session 9: Scraping Interactive Web Pages

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2024-08-01

Introduction

This Course

tinytable_7r5zoevcxe96nomuv7z4
Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

The Plan for Today

In this session, we learn how to hunt down wild data. We will:

  • Learn how to find secret APIs
  • Emulate a Browser
  • We focus specifically on step 1 below

Original Image Source: prowebscraper.com

Philipp Pilz via unsplash.com

Request & Collect Raw Data: a closer look

Common Problems

Imagine you wanted to scrape researchgate.net, since it contains self-created profiles of many researchers. However, when you try to get the html content:

library(rvest)
read_html("https://www.researchgate.net/profile/Johannes-Gruber-2")
{html_document}
<html lang="en">
[1] <head>\n<meta charset="utf-8">\n<meta http-equiv="content-type" content=" ...
[2] <body class="logged-out">\n<div id="lite-page" class="lite-page lite-page ...

If you don’t know what an HTTP error means, you can go to https://http.cat and have the status explained in a fun way. Below I use a little convenience function:

error_cat <- function(error) {
  link <- paste0("https://http.cat/images/", error, ".jpg")
  knitr::include_graphics(link)
}
error_cat(403)

So what’s going on?

  • If something like this happens, the server essentially did not fullfill our request
  • This is because the website seems to have some special requirements for serving the (correct) content. These could be:
    • specific user agents
    • other specific headers
    • login through browser cookies
  • To find out how the browser manages to get the correct response, we can use the Network tab in the inspection tool

Strategy 1: Emulate what the Browser is Doing

Open the Inspect Window Again:

But this time, we focus on the Network tab:

Here we get an overview of all the network activity of the browser and the individual requests for data that are performed. Clear the network log first and reload the page to see what is going on. Finding the right call is not always easy, but in most cases, we want:

  • a call with status 200 (OK/successful)
  • a document type
  • something that is at least a few kB in size
  • Initiator is usually “other” (we initiated the call by refreshing)

When you identified the call, you can right click -> copy -> copy as cURL

More on cURL Calls

What is cURL:

  • cURL is a library that can make HTTP requests.
  • it is widely used for API calls from the terminal.
  • it lists the parameters of a call in a pretty readable manner:
    • the unnamed argument in the beginning is the Uniform Resource Locator (URL) the request goes to
    • -H arguments describe the headers, which are arguments sent with the call
    • -d is the data or body of a request, which is used e.g., for uploading things
    • -o/-O can be used to write the response to a file (otherwise the response is returned to the screen)
    • --compressed means to ask for a compressed response which is unpacked locally (saves bandwith)
curl 'https://www.researchgate.net/profile/Johannes-Gruber-2' \
  -H 'authority: www.researchgate.net' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-GB,en;q=0.9' \
  -H 'cache-control: max-age=0' \
  -H '[Redacted]' \
  -H 'sec-ch-ua: "Chromium";v="115", "Not/A)Brand";v="99"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Linux"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  --compressed

httr2::curl_translate()

  • We have seen httr2::curl_translate() in action yesterday
  • It can also convert more complicated API calls that make look R no diffrent from a regular browser
  • (Remember: you need to escape all " in the call, press ctrl + F to open the Find & Replace tool and put " in the find \" in the replace field and go through all matches except the first and last):
library(httr2)
httr2::curl_translate(
"curl 'https://www.researchgate.net/profile/Johannes-Gruber-2' \
  -H 'authority: www.researchgate.net' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-GB,en;q=0.9' \
  -H 'cache-control: max-age=0' \
  -H 'cookie: [Redacted]' \
  -H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: \"Linux\"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  --compressed"
)
request("https://www.researchgate.net/profile/Johannes-Gruber-2") |> 
  req_headers(
    authority = "www.researchgate.net",
    accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    `accept-language` = "en-GB,en;q=0.9",
    `cache-control` = "max-age=0",
    cookie = "[Redacted]",
    `upgrade-insecure-requests` = "1",
    `user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |> 
  req_perform()

‘Emulating’ the Browser Request

request("https://www.researchgate.net/profile/Johannes-Gruber-2") |>
  req_headers(
    authority = "www.researchgate.net",
    accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    `accept-language` = "en-GB,en;q=0.9",
    `cache-control` = "max-age=0",
    cookie = "[Redacted]",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
    `sec-fetch-dest` = "document",
    `sec-fetch-mode` = "navigate",
    `sec-fetch-site` = "cross-site",
    `sec-fetch-user` = "?1",
    `upgrade-insecure-requests` = "1",
    `user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |>
  req_perform()

Example: ICA (International Communication Association) 2023 Conference

Goal

  • Let’s say we want to build a database of conference attendance
  • So for each conference website we want to get:
    • Speakers
    • (Co-)authors
    • Paper/talk titles
    • Panel (to see who was in the same ones)

Trying to scrape the programme

  • The page looks straightforward enough!
  • There is a “Conference Schedule” with links to the individual panels
  • The table has a pretty nice class by which we can select it: class="agenda-content"
html <- read_html("https://www.icahdq.org/mpage/ICA23-Program")
Error in open.connection(x, "rb"): HTTP error 403.

Let’s Check our Network Tab

  • I noticed a request that takes quite long and retrieves a relatively large object (500kB)
  • Clicking on it opens another window showing the response
  • Wait, is this a json with the entire conference schedule?

Translating the cURL call

curl_translate("curl 'https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Referer: https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  -H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: \"Linux\"' \
  --compressed")
request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/") |> 
  req_url_query(
    event_id = "JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4=",
  ) |> 
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |> 
  req_perform()

Requesting the json (?)

ica_data <- request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D") |> 
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    Connection = "keep-alive",
    Pragma = "no-cache",
    Referer = "https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/",
    `Sec-Fetch-Dest` = "empty",
    `Sec-Fetch-Mode` = "cors",
    `Sec-Fetch-Site` = "same-origin",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
  ) |> 
  req_perform() |> 
  resp_body_json()
object.size(ica_data) |> 
  format("MB")
[1] "9.5 Mb"

It worked!

Wrangling with Json

  • This json file or the R object it produces is quite intimidating.
  • To get to a certain panel on the fourth day, for example, we have to enter this insane path:
ica_data[["data"]][["agenda"]][[4]][["time_ranges"]][[3]][[2]][[65]][[1]][["sessions"]][[1]] |> 
  lobstr::tree(max_length = 30)
<list>
├─id: 3113186
├─name: "Race, Ethnicity, and Religion: M..."
├─event_id: "aic1_202305"
├─start_time: "09:00"
├─end_time: "10:15"
├─calendar_stime: "2023-05-28 09:00:00"
├─calendar_etime: "2023-05-28 10:15:00"
├─place: "M - Chestnut East"
├─desc: "<br /><br /><b>Papers: </b><br /..."
├─extra: <list>
│ ├─docs: <list>
│ ├─live_stream: <list>
│ │ └─url: ""
│ ├─recorded_video: <list>
│ │ └─url: ""
│ ├─order: 791
│ ├─type: "Session"
│ ├─rate_enabled: TRUE
│ └─session_feedback_enable: TRUE
├─docs: <list>
├─session_order: 791
├─session_feedback_enable: TRUE
├─live_stream: <list>
│ └─url: ""
├─recorded_video: <list>
│ └─url: ""
├─upload_video: <NULL>
├─simulive_upload_video: <NULL>
├─speaker: <list>
... 
  • Essentially, someone pressed a relational database into a list format and we now have to scramble to cope with this monstrosity

Parsing the Json

I could not come up with a better method so far. The only way to extract the data is with a nested for loop going through all days and all entries in the object and looking for elements called “sessions”.

library(tidyverse, warn.conflicts = FALSE)
sessions <- list()

for (day in 1:5) {
  
  times <- ica_data[["data"]][["agenda"]][[day]][["time_ranges"]]
  
  for (l_one in seq_along(pluck(times))) {
    for (l_two in seq_along(pluck(times, l_one))) {
      for (l_three in seq_along(pluck(times, l_one, l_two))) {
        for (l_four in seq_along(pluck(times, l_one, l_two, l_three))) {
          
          session <- pluck(times, l_one, l_two, l_three, l_four, "sessions", 1)
          id <- pluck(session, "id")
          if (!is.null(id)) {
            id <- as.character(id)
            sessions[[id]] <- session
          }
          
        }
      }
    }
  }
}

Parsing the Json data

ica_data_df <- tibble(
  panel_id = map_int(sessions, "id"),
  panel_name = map_chr(sessions, "name"),
  time = map_chr(sessions, "calendar_stime"),
  desc = map_chr(sessions, function(s) pluck(s, "desc", .default = NA))
)
ica_data_df
# A tibble: 881 × 4
   panel_id panel_name                                               time  desc 
      <int> <chr>                                                    <chr> <chr>
 1  3113155 PRECONFERENCE: Games and the (Playful) Future of Commun… 2023… "Rec…
 2  3113156 PRECONFERENCE: Generation Z and Global Communication     2023… "Gen…
 3  3113166 PRECONFERENCE: Nothing About Us, Without Us: Authentic … 2023… "Thi…
 4  3113172 PRECONFERENCE: Reimagining the Field of Media, War and … 2023… "As …
 5  3113175 PRECONFERENCE: The Legacies of Elihu Katz                2023… "Eli…
 6  3112705 Human-Machine Preconference Breakout (room 2)            2023…  <NA>
 7  3113080 New Avoidance Preconference Breakout (room 2)            2023…  <NA>
 8  3113150 PRECONFERENCE: 12th Annual Doctoral Consortium of the C… 2023… "The…
 9  3113154 PRECONFERENCE: Ethics of Critically Interrogating and R… 2023… "The…
10  3113158 PRECONFERENCE: Human-Machine Communication: Authenticit… 2023… "The…
# ℹ 871 more rows

Extracting paper title and authors

Finally we want to parse the HTML in the description column.

ica_data_df$desc[100]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      3113023 
"<br /><br /><b>Participants: </b><br /><b><i>(Chairs) </i></b>Wayne Xu, U of Massachusetts Amherst<br /><br /><b>Papers: </b><br />Disentangling the Longitudinal Relationship Between Social Media Use, Political Expression and Political Participation: What Do We Really Know?<br /><i>Jörg Matthes, U of Vienna</i><br /><i>Andreas Nanz, U of Vienna</i><br /><i>Marlis Stubenvoll, U of Vienna</i><br /><i>Ruta Kaskeleviciute, U of Vienna</i><br /><br />Political Discussions on Russian YouTube: How Did They Change Since the Start of the War in Ukraine?<br /><i>Ekaterina Romanova, U of Florida</i><br /><br />Perceptions of and Reactions to Different Types of Incivility in Public Online Discussions: Results of an Online Experiment<br /><i>Marike Bormann, Unviersity of Düsseldorf</i><br /><i>Dominique Heinbach, Heinrich-Heine-U</i><br /><i>Jan Kluck, U of Duisburg-Essen</i><br /><i>Marc Ziegele, Heinrich Heine U</i><br /><br />When Trust in AI Mediates: AI News Use, Public Discussion, and Civic Participation<br /><i>Seungahn Nah, U of Florida</i><br /><i>Chun Shao, Arizona State U</i><br /><i>Ekaterina Romanova, U of Florida</i><br /><i>Gwiwon Nam, U of Florida</i><br /><i>Fanjue Liu, U of Florida</i> <a href='https://ica2023.cadmore.media/object/451094' style='text-decoration: none; background-color: #789F90; color: #FFFFFF; padding: 5px 10px; border: 1px solid #789F90; border-radius: 15px;'>Open Session</a><br /><br />" 

We can inspect HTML content by writing it to a temporary file and opening it in the browser. Below is a function that does this automatically for you:

check_in_browser <- function(html) {
  tmp <- tempfile(fileext = ".html")
  writeLines(as.character(html), tmp)
  browseURL(tmp)
}
check_in_browser(ica_data_df$desc[100])

Extracting paper title and authors using a function

I wrote another function for this. You can check some of the panels using the browser: check_in_browser(ica_data_df$desc[100]).

pull_papers <- function(desc) {
  # we extract the html code starting with the papers line
  papers <- str_extract(desc, "<b>Papers: </b>.+$") |> 
    str_remove("<b>Papers: </b><br />") |> 
    # we split the html by double line breaks, since it is not properly formatted as paragraphs
    strsplit("<br /><br />", fixed = TRUE) |> 
    pluck(1)
  
  
  # if there is no html code left, just return NAs
  if (all(is.na(papers))) {
    return(list(list(paper_title = NA, authors = NA)))
  } else {
    # otherwise we loop through each paper
    map(papers, function(t) {
      html <- read_html(t)
      
      # first line is the title
      title <- html |> 
        html_text2() |> 
        str_extract("^.+\n")
      
      # at least authors are formatted italice
      authors <- html_elements(html, "i") |> 
        html_text2()
      
      list(paper_title = title, authors = authors)
    })
  }
}

Now we have all the information we wanted:

ica_data_df_tidy <- ica_data_df |> 
  slice(-613) |> 
  mutate(papers = map(desc, pull_papers)) |> 
  unnest(papers) |> 
  unnest_wider(papers) |> 
  unnest(authors) |> 
  select(-desc) |> 
  filter(!is.na(authors))
ica_data_df_tidy
# A tibble: 8,169 × 5
   panel_id panel_name                            time       paper_title authors
      <int> <chr>                                 <chr>      <chr>       <chr>  
 1  3113249 The Powers of Platforms               2023-05-2… "Serve the… Changw…
 2  3113249 The Powers of Platforms               2023-05-2… "Serve the… Ziyi W…
 3  3113249 The Powers of Platforms               2023-05-2… "Serve the… Joel G…
 4  3113249 The Powers of Platforms               2023-05-2… "Empowered… Andrea…
 5  3113249 The Powers of Platforms               2023-05-2… "Empowered… Jacob …
 6  3113249 The Powers of Platforms               2023-05-2… "The Rise … Guy Ho…
 7  3113249 The Powers of Platforms               2023-05-2… "Google Ne… Lucia …
 8  3113249 The Powers of Platforms               2023-05-2… "Google Ne… Mathia…
 9  3113249 The Powers of Platforms               2023-05-2… "Google Ne… Amalia…
10  3112411 Affiliate Journals Top Papers Session 2023-05-2… "One Year … Eloria…
# ℹ 8,159 more rows
ica_data_df_tidy |> 
  filter(!duplicated(paper_title))
# A tibble: 3,277 × 5
   panel_id panel_name                                 time  paper_title authors
      <int> <chr>                                      <chr> <chr>       <chr>  
 1  3113249 The Powers of Platforms                    2023… "Serve the… Changw…
 2  3113249 The Powers of Platforms                    2023… "Empowered… Andrea…
 3  3113249 The Powers of Platforms                    2023… "The Rise … Guy Ho…
 4  3113249 The Powers of Platforms                    2023… "Google Ne… Lucia …
 5  3112411 Affiliate Journals Top Papers Session      2023… "One Year … Eloria…
 6  3112411 Affiliate Journals Top Papers Session      2023… "Digital A… Michel…
 7  3112411 Affiliate Journals Top Papers Session      2023… "Knowledge… Xiao Z…
 8  3112411 Affiliate Journals Top Papers Session      2023… "Stop Stud… Benjam…
 9  3112488 Communication in Interorganizational Coll… 2023… "Towards a… Erich …
10  3112488 Communication in Interorganizational Coll… 2023… "Nonprofit… Sophia…
# ℹ 3,267 more rows

Exercises 1

First, review the material and make sure you have a broad understanding how to:

  • look at the requests the browser makes
  • understand how you can copy a curl call
  • practice how you can translate it into R code
  • why we go this route and do not simply use read_html
  1. Open the ICA site in your browser and inspect the network traffic. Can you identify the call to the programme json?
  2. Copy the curl code to R and translate it to get the same

Example: X-Twitter

Goal

  1. Tweets from a Twitter profile
  2. Get the text, likes, shares and comments

Can we use rvest?

xhtml <- read_html("https://x.com/EssexSumSchool")

At least the request isn’t failing…

xhtml |> 
  html_elements("[data-testid=\"cellInnerDiv\"]")
{xml_nodeset (0)}

At least one of these elements should be here!

Can we use rvest?

We can check the conent that we collected from x.com using the function we defined earlier:

check_in_browser(xhtml)

Bummer, it’s only giving us the login page…

Probing the hidden/internal API

Translating a request

curl_translate("curl 'https://x.com/i/api/graphql/g4sgqIykZaGDN0_w_ZraYw/UserTweets?variables=%7B%22userId%22%3A%221525016244%22%2C%22count%22%3A20%2C%22includePromotedContent%22%3Atrue%2C%22withQuickPromoteEligibilityTweetFields%22%3Atrue%2C%22withVoice%22%3Atrue%2C%22withV2Timeline%22%3Atrue%7D&features=%7B%22rweb_tipjar_consumption_enabled%22%3Atrue%2C%22responsive_web_graphql_exclude_directive_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Afalse%2C%22creator_subscriptions_tweet_preview_api_enabled%22%3Atrue%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22communities_web_enable_tweet_community_results_fetch%22%3Atrue%2C%22c9s_tweet_anatomy_moderator_badge_enabled%22%3Atrue%2C%22articles_preview_enabled%22%3Atrue%2C%22tweetypie_unmention_optimization_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_translatable_rweb_tweet_is_translatable_enabled%22%3Atrue%2C%22view_counts_everywhere_api_enabled%22%3Atrue%2C%22longform_notetweets_consumption_enabled%22%3Atrue%2C%22responsive_web_twitter_article_tweet_consumption_enabled%22%3Atrue%2C%22tweet_awards_web_tipping_enabled%22%3Afalse%2C%22creator_subscriptions_quote_tweet_preview_enabled%22%3Afalse%2C%22freedom_of_speech_not_reach_fetch_enabled%22%3Atrue%2C%22standardized_nudges_misinfo%22%3Atrue%2C%22tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled%22%3Atrue%2C%22rweb_video_timestamps_enabled%22%3Atrue%2C%22longform_notetweets_rich_text_read_enabled%22%3Atrue%2C%22longform_notetweets_inline_media_enabled%22%3Atrue%2C%22responsive_web_enhance_cards_enabled%22%3Afalse%7D&fieldToggles=%7B%22withArticlePlainText%22%3Afalse%7D' --compressed -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br, zstd' -H 'Referer: https://x.com/EssexSumSchool' -H 'content-type: application/json' -H 'X-Client-UUID: e57787fa-c0a7-4dd8-afe4-eec2c676cf62' -H 'x-twitter-auth-type: OAuth2Session' -H 'x-csrf-token: 31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3' -H 'x-twitter-client-language: en-GB' -H 'x-twitter-active-user: yes' -H 'x-client-transaction-id: ifN3f4/fJUu5rqnSz6p0olGsxzjrZvREdhRXahW+9TFbQCtdX5hd8bw2bPUnxdKNTjZp0ouBJ+LhZf9sWSjkQyTZlIscig' -H 'Sec-Fetch-Dest: empty' -H 'Sec-Fetch-Mode: cors' -H 'Sec-Fetch-Site: same-origin' -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA' -H 'Connection: keep-alive' -H 'Cookie: guest_id=v1%3A171990111009154001; night_mode=2; twtr_pixel_opt_in=Y; gt=1818922828541870259; g_state={\"i_p\":1722507175248,\"i_l\":1}; kdt=S7w84InVCLYXaIPfOjU7iDez89j7DzsntO8phmPp; twid=u%3D1632536605; ct0=31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3; auth_token=8b79ced278aa8248351f20ac671a35dac93ca5da; att=1-CFR8q37NCOcMj1mQC4pycHeSTEFTNAw2X7EBY1Rl; lang=en-gb; d_prefs=MToxLGNvbnNlbnRfdmVyc2lvbjoyLHRleHRfdmVyc2lvbjoxMDAw; guest_id_ads=v1%3A171990111009154001; guest_id_marketing=v1%3A171990111009154001; personalization_id=\"v1_DwNxJ1HLE+VkeJC51vxzFA==\"' -H 'TE: trailers'")
request("https://x.com/i/api/graphql/g4sgqIykZaGDN0_w_ZraYw/UserTweets") |> 
  req_url_query(
    variables = '{"userId":"1525016244","count":20,"includePromotedContent":true,"withQuickPromoteEligibilityTweetFields":true,"withVoice":true,"withV2Timeline":true}',
    features = '{"rweb_tipjar_consumption_enabled":true,"responsive_web_graphql_exclude_directive_enabled":true,"verified_phone_label_enabled":false,"creator_subscriptions_tweet_preview_api_enabled":true,"responsive_web_graphql_timeline_navigation_enabled":true,"responsive_web_graphql_skip_user_profile_image_extensions_enabled":false,"communities_web_enable_tweet_community_results_fetch":true,"c9s_tweet_anatomy_moderator_badge_enabled":true,"articles_preview_enabled":true,"tweetypie_unmention_optimization_enabled":true,"responsive_web_edit_tweet_api_enabled":true,"graphql_is_translatable_rweb_tweet_is_translatable_enabled":true,"view_counts_everywhere_api_enabled":true,"longform_notetweets_consumption_enabled":true,"responsive_web_twitter_article_tweet_consumption_enabled":true,"tweet_awards_web_tipping_enabled":false,"creator_subscriptions_quote_tweet_preview_enabled":false,"freedom_of_speech_not_reach_fetch_enabled":true,"standardized_nudges_misinfo":true,"tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled":true,"rweb_video_timestamps_enabled":true,"longform_notetweets_rich_text_read_enabled":true,"longform_notetweets_inline_media_enabled":true,"responsive_web_enhance_cards_enabled":false}',
    fieldToggles = '{"withArticlePlainText":false}',
  ) |> 
  req_headers(
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0",
    Accept = "*/*",
    `Accept-Language` = "en-US,en;q=0.5",
    `Accept-Encoding` = "gzip, deflate, br, zstd",
    `content-type` = "application/json",
    `X-Client-UUID` = "e57787fa-c0a7-4dd8-afe4-eec2c676cf62",
    `x-twitter-auth-type` = "OAuth2Session",
    `x-csrf-token` = "31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3",
    `x-twitter-client-language` = "en-GB",
    `x-twitter-active-user` = "yes",
    `x-client-transaction-id` = "ifN3f4/fJUu5rqnSz6p0olGsxzjrZvREdhRXahW+9TFbQCtdX5hd8bw2bPUnxdKNTjZp0ouBJ+LhZf9sWSjkQyTZlIscig",
    authorization = "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA",
    Cookie = 'guest_id=v1%3A171990111009154001; night_mode=2; twtr_pixel_opt_in=Y; gt=1818922828541870259; g_state={"i_p":1722507175248,"i_l":1}; kdt=S7w84InVCLYXaIPfOjU7iDez89j7DzsntO8phmPp; twid=u%3D1632536605; ct0=31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3; auth_token=8b79ced278aa8248351f20ac671a35dac93ca5da; att=1-CFR8q37NCOcMj1mQC4pycHeSTEFTNAw2X7EBY1Rl; lang=en-gb; d_prefs=MToxLGNvbnNlbnRfdmVyc2lvbjoyLHRleHRfdmVyc2lvbjoxMDAw; guest_id_ads=v1%3A171990111009154001; guest_id_marketing=v1%3A171990111009154001; personalization_id=v1_DwNxJ1HLE+VkeJC51vxzFA==',
    TE = "trailers",
  ) |> 
  req_perform()
twitter_resp <- request("https://x.com/i/api/graphql/g4sgqIykZaGDN0_w_ZraYw/UserTweets") |> 
  req_url_query(
    variables = '{"userId":"1525016244","count":20,"includePromotedContent":true,"withQuickPromoteEligibilityTweetFields":true,"withVoice":true,"withV2Timeline":true}',
    features = '{"rweb_tipjar_consumption_enabled":true,"responsive_web_graphql_exclude_directive_enabled":true,"verified_phone_label_enabled":false,"creator_subscriptions_tweet_preview_api_enabled":true,"responsive_web_graphql_timeline_navigation_enabled":true,"responsive_web_graphql_skip_user_profile_image_extensions_enabled":false,"communities_web_enable_tweet_community_results_fetch":true,"c9s_tweet_anatomy_moderator_badge_enabled":true,"articles_preview_enabled":true,"tweetypie_unmention_optimization_enabled":true,"responsive_web_edit_tweet_api_enabled":true,"graphql_is_translatable_rweb_tweet_is_translatable_enabled":true,"view_counts_everywhere_api_enabled":true,"longform_notetweets_consumption_enabled":true,"responsive_web_twitter_article_tweet_consumption_enabled":true,"tweet_awards_web_tipping_enabled":false,"creator_subscriptions_quote_tweet_preview_enabled":false,"freedom_of_speech_not_reach_fetch_enabled":true,"standardized_nudges_misinfo":true,"tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled":true,"rweb_video_timestamps_enabled":true,"longform_notetweets_rich_text_read_enabled":true,"longform_notetweets_inline_media_enabled":true,"responsive_web_enhance_cards_enabled":false}',
    fieldToggles = '{"withArticlePlainText":false}',
  ) |> 
  req_headers(
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0",
    Accept = "*/*",
    `Accept-Language` = "en-US,en;q=0.5",
    `Accept-Encoding` = "gzip, deflate, br, zstd",
    `content-type` = "application/json",
    `X-Client-UUID` = "e57787fa-c0a7-4dd8-afe4-eec2c676cf62",
    `x-twitter-auth-type` = "OAuth2Session",
    `x-csrf-token` = "31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3",
    `x-twitter-client-language` = "en-GB",
    `x-twitter-active-user` = "yes",
    `x-client-transaction-id` = "ifN3f4/fJUu5rqnSz6p0olGsxzjrZvREdhRXahW+9TFbQCtdX5hd8bw2bPUnxdKNTjZp0ouBJ+LhZf9sWSjkQyTZlIscig",
    authorization = "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA",
    Cookie = 'guest_id=v1%3A171990111009154001; night_mode=2; twtr_pixel_opt_in=Y; gt=1818922828541870259; g_state={"i_p":1722507175248,"i_l":1}; kdt=S7w84InVCLYXaIPfOjU7iDez89j7DzsntO8phmPp; twid=u%3D1632536605; ct0=31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3; auth_token=8b79ced278aa8248351f20ac671a35dac93ca5da; att=1-CFR8q37NCOcMj1mQC4pycHeSTEFTNAw2X7EBY1Rl; lang=en-gb; d_prefs=MToxLGNvbnNlbnRfdmVyc2lvbjoyLHRleHRfdmVyc2lvbjoxMDAw; guest_id_ads=v1%3A171990111009154001; guest_id_marketing=v1%3A171990111009154001; personalization_id=v1_DwNxJ1HLE+VkeJC51vxzFA==',
    TE = "trailers",
  ) |> 
  req_perform()

Parsing the Twitter data

This is the code we developed in session 2. We can use it again to get a clean table with some interesting information

ess_tweets <- twitter_resp |> 
  resp_body_json()


entries <- pluck(ess_tweets, "data", "user", "result", "timeline_v2", "timeline", "instructions", 3L, "entries")

tweets <- map(entries, function(x) pluck(x, "content", "itemContent", "tweet_results", "result", "legacy"))

tweets_df <- map(tweets, function(t) {
  tibble(
    id = t$id_str,
    user_id = t$user_id_str,
    created_at = t$created_at,
    full_text = t$full_text,
    favorite_count = t$favorite_count,
    retweet_count = t$retweet_count,
    bookmark_count = t$bookmark_count
  )
}) |> 
  bind_rows()
tweets_df
# A tibble: 13 × 7
   id                  user_id created_at full_text favorite_count retweet_count
   <chr>               <chr>   <chr>      <chr>              <int>         <int>
 1 1818577938222141854 152501… Wed Jul 3… "Join to…              7             2
 2 1818267556605505645 152501… Tue Jul 3… "📢#ESS2…              2             0
 3 1817969272951619976 152501… Mon Jul 2… "Getting…              5             2
 4 1817840304487104716 152501… Mon Jul 2… "Mark yo…              9             4
 5 1816468781277065463 152501… Thu Jul 2… "Congrat…             23             2
 6 1816105085648158748 152501… Wed Jul 2… "Join us…              6             2
 7 1815349232217305448 152501… Mon Jul 2… "Welcome…             27             4
 8 1814324981129462000 152501… Fri Jul 1… "Our cou…              2             1
 9 1814321604815433978 152501… Fri Jul 1… "Interes…              0             0
10 1813885852000424274 152501… Thu Jul 1… "Would y…              0             0
11 1813535484532125895 152501… Wed Jul 1… "RT @Ess…              0             2
12 1812805529590218988 152501… Mon Jul 1… "Only on…              1             2
13 1812795777372082330 152501… Mon Jul 1… "RT @DrD…              0             7
# ℹ 1 more variable: bookmark_count <int>

Translating a second request

curl_translate("curl 'https://x.com/i/api/graphql/g4sgqIykZaGDN0_w_ZraYw/UserTweets?variables=%7B%22userId%22%3A%221525016244%22%2C%22count%22%3A20%2C%22cursor%22%3A%22DAABCgABGT4dVnF__-sKAAIZI4Qc2hsQ5AgAAwAAAAIAAA%22%2C%22includePromotedContent%22%3Atrue%2C%22withQuickPromoteEligibilityTweetFields%22%3Atrue%2C%22withVoice%22%3Atrue%2C%22withV2Timeline%22%3Atrue%7D&features=%7B%22rweb_tipjar_consumption_enabled%22%3Atrue%2C%22responsive_web_graphql_exclude_directive_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Afalse%2C%22creator_subscriptions_tweet_preview_api_enabled%22%3Atrue%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22communities_web_enable_tweet_community_results_fetch%22%3Atrue%2C%22c9s_tweet_anatomy_moderator_badge_enabled%22%3Atrue%2C%22articles_preview_enabled%22%3Atrue%2C%22tweetypie_unmention_optimization_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_translatable_rweb_tweet_is_translatable_enabled%22%3Atrue%2C%22view_counts_everywhere_api_enabled%22%3Atrue%2C%22longform_notetweets_consumption_enabled%22%3Atrue%2C%22responsive_web_twitter_article_tweet_consumption_enabled%22%3Atrue%2C%22tweet_awards_web_tipping_enabled%22%3Afalse%2C%22creator_subscriptions_quote_tweet_preview_enabled%22%3Afalse%2C%22freedom_of_speech_not_reach_fetch_enabled%22%3Atrue%2C%22standardized_nudges_misinfo%22%3Atrue%2C%22tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled%22%3Atrue%2C%22rweb_video_timestamps_enabled%22%3Atrue%2C%22longform_notetweets_rich_text_read_enabled%22%3Atrue%2C%22longform_notetweets_inline_media_enabled%22%3Atrue%2C%22responsive_web_enhance_cards_enabled%22%3Afalse%7D&fieldToggles=%7B%22withArticlePlainText%22%3Afalse%7D' --compressed -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br, zstd' -H 'Referer: https://x.com/EssexSumSchool' -H 'content-type: application/json' -H 'X-Client-UUID: e57787fa-c0a7-4dd8-afe4-eec2c676cf62' -H 'x-twitter-auth-type: OAuth2Session' -H 'x-csrf-token: 31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3' -H 'x-twitter-client-language: en-GB' -H 'x-twitter-active-user: yes' -H 'x-client-transaction-id: KlDU3Cx8hugaDQpxbAnXAfIPZJtIxVfn1bf0ybYdVpL444j+/Dv+Uh+Vz1aEZnEu7WnLcShbjnWhSlaEbxumlg8ly/67KQ' -H 'Sec-Fetch-Dest: empty' -H 'Sec-Fetch-Mode: cors' -H 'Sec-Fetch-Site: same-origin' -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA' -H 'Connection: keep-alive' -H 'Cookie: guest_id=v1%3A171990111009154001; night_mode=2; twtr_pixel_opt_in=Y; gt=1818922828541870259; g_state={\"i_p\":1722507175248,\"i_l\":1}; kdt=S7w84InVCLYXaIPfOjU7iDez89j7DzsntO8phmPp; twid=u%3D1632536605; ct0=31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3; auth_token=8b79ced278aa8248351f20ac671a35dac93ca5da; att=1-CFR8q37NCOcMj1mQC4pycHeSTEFTNAw2X7EBY1Rl; lang=en-gb; d_prefs=MToxLGNvbnNlbnRfdmVyc2lvbjoyLHRleHRfdmVyc2lvbjoxMDAw; guest_id_ads=v1%3A171990111009154001; guest_id_marketing=v1%3A171990111009154001; personalization_id=\"v1_DwNxJ1HLE+VkeJC51vxzFA==\"' -H 'TE: trailers'")
request("https://x.com/i/api/graphql/g4sgqIykZaGDN0_w_ZraYw/UserTweets") |> 
  req_url_query(
    variables = '{"userId":"1525016244","count":20,"cursor":"DAABCgABGT4dVnF__-sKAAIZI4Qc2hsQ5AgAAwAAAAIAAA","includePromotedContent":true,"withQuickPromoteEligibilityTweetFields":true,"withVoice":true,"withV2Timeline":true}',
    features = '{"rweb_tipjar_consumption_enabled":true,"responsive_web_graphql_exclude_directive_enabled":true,"verified_phone_label_enabled":false,"creator_subscriptions_tweet_preview_api_enabled":true,"responsive_web_graphql_timeline_navigation_enabled":true,"responsive_web_graphql_skip_user_profile_image_extensions_enabled":false,"communities_web_enable_tweet_community_results_fetch":true,"c9s_tweet_anatomy_moderator_badge_enabled":true,"articles_preview_enabled":true,"tweetypie_unmention_optimization_enabled":true,"responsive_web_edit_tweet_api_enabled":true,"graphql_is_translatable_rweb_tweet_is_translatable_enabled":true,"view_counts_everywhere_api_enabled":true,"longform_notetweets_consumption_enabled":true,"responsive_web_twitter_article_tweet_consumption_enabled":true,"tweet_awards_web_tipping_enabled":false,"creator_subscriptions_quote_tweet_preview_enabled":false,"freedom_of_speech_not_reach_fetch_enabled":true,"standardized_nudges_misinfo":true,"tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled":true,"rweb_video_timestamps_enabled":true,"longform_notetweets_rich_text_read_enabled":true,"longform_notetweets_inline_media_enabled":true,"responsive_web_enhance_cards_enabled":false}',
    fieldToggles = '{"withArticlePlainText":false}',
  ) |> 
  req_headers(
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0",
    Accept = "*/*",
    `Accept-Language` = "en-US,en;q=0.5",
    `Accept-Encoding` = "gzip, deflate, br, zstd",
    `content-type` = "application/json",
    `X-Client-UUID` = "e57787fa-c0a7-4dd8-afe4-eec2c676cf62",
    `x-twitter-auth-type` = "OAuth2Session",
    `x-csrf-token` = "31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3",
    `x-twitter-client-language` = "en-GB",
    `x-twitter-active-user` = "yes",
    `x-client-transaction-id` = "KlDU3Cx8hugaDQpxbAnXAfIPZJtIxVfn1bf0ybYdVpL444j+/Dv+Uh+Vz1aEZnEu7WnLcShbjnWhSlaEbxumlg8ly/67KQ",
    authorization = "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA",
    Cookie = 'guest_id=v1%3A171990111009154001; night_mode=2; twtr_pixel_opt_in=Y; gt=1818922828541870259; g_state={"i_p":1722507175248,"i_l":1}; kdt=S7w84InVCLYXaIPfOjU7iDez89j7DzsntO8phmPp; twid=u%3D1632536605; ct0=31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3; auth_token=8b79ced278aa8248351f20ac671a35dac93ca5da; att=1-CFR8q37NCOcMj1mQC4pycHeSTEFTNAw2X7EBY1Rl; lang=en-gb; d_prefs=MToxLGNvbnNlbnRfdmVyc2lvbjoyLHRleHRfdmVyc2lvbjoxMDAw; guest_id_ads=v1%3A171990111009154001; guest_id_marketing=v1%3A171990111009154001; personalization_id=v1_DwNxJ1HLE+VkeJC51vxzFA==',
    TE = "trailers",
  ) |> 
  req_perform()
twitter_resp2 <- request("https://x.com/i/api/graphql/g4sgqIykZaGDN0_w_ZraYw/UserTweets") |> 
  req_url_query(
    variables = '{"userId":"1525016244","count":20,"cursor":"DAABCgABGT4dVnF__-sKAAIZI4Qc2hsQ5AgAAwAAAAIAAA","includePromotedContent":true,"withQuickPromoteEligibilityTweetFields":true,"withVoice":true,"withV2Timeline":true}',
    features = '{"rweb_tipjar_consumption_enabled":true,"responsive_web_graphql_exclude_directive_enabled":true,"verified_phone_label_enabled":false,"creator_subscriptions_tweet_preview_api_enabled":true,"responsive_web_graphql_timeline_navigation_enabled":true,"responsive_web_graphql_skip_user_profile_image_extensions_enabled":false,"communities_web_enable_tweet_community_results_fetch":true,"c9s_tweet_anatomy_moderator_badge_enabled":true,"articles_preview_enabled":true,"tweetypie_unmention_optimization_enabled":true,"responsive_web_edit_tweet_api_enabled":true,"graphql_is_translatable_rweb_tweet_is_translatable_enabled":true,"view_counts_everywhere_api_enabled":true,"longform_notetweets_consumption_enabled":true,"responsive_web_twitter_article_tweet_consumption_enabled":true,"tweet_awards_web_tipping_enabled":false,"creator_subscriptions_quote_tweet_preview_enabled":false,"freedom_of_speech_not_reach_fetch_enabled":true,"standardized_nudges_misinfo":true,"tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled":true,"rweb_video_timestamps_enabled":true,"longform_notetweets_rich_text_read_enabled":true,"longform_notetweets_inline_media_enabled":true,"responsive_web_enhance_cards_enabled":false}',
    fieldToggles = '{"withArticlePlainText":false}',
  ) |> 
  req_headers(
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0",
    Accept = "*/*",
    `Accept-Language` = "en-US,en;q=0.5",
    `Accept-Encoding` = "gzip, deflate, br, zstd",
    `content-type` = "application/json",
    `X-Client-UUID` = "e57787fa-c0a7-4dd8-afe4-eec2c676cf62",
    `x-twitter-auth-type` = "OAuth2Session",
    `x-csrf-token` = "31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3",
    `x-twitter-client-language` = "en-GB",
    `x-twitter-active-user` = "yes",
    `x-client-transaction-id` = "KlDU3Cx8hugaDQpxbAnXAfIPZJtIxVfn1bf0ybYdVpL444j+/Dv+Uh+Vz1aEZnEu7WnLcShbjnWhSlaEbxumlg8ly/67KQ",
    authorization = "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA",
    Cookie = 'guest_id=v1%3A171990111009154001; night_mode=2; twtr_pixel_opt_in=Y; gt=1818922828541870259; g_state={"i_p":1722507175248,"i_l":1}; kdt=S7w84InVCLYXaIPfOjU7iDez89j7DzsntO8phmPp; twid=u%3D1632536605; ct0=31671b3cac0fc2462816343b781cbe3aba7576438533eccc22fe8b23f9fc99154f1347e7c3793e72473bd4ac978cc500228b97898021825a561b2e5339cce9655515b8cb0ec20648866d7a47e0642fa3; auth_token=8b79ced278aa8248351f20ac671a35dac93ca5da; att=1-CFR8q37NCOcMj1mQC4pycHeSTEFTNAw2X7EBY1Rl; lang=en-gb; d_prefs=MToxLGNvbnNlbnRfdmVyc2lvbjoyLHRleHRfdmVyc2lvbjoxMDAw; guest_id_ads=v1%3A171990111009154001; guest_id_marketing=v1%3A171990111009154001; personalization_id=v1_DwNxJ1HLE+VkeJC51vxzFA==',
    TE = "trailers",
  ) |> 
  req_perform()

Parsing the Twitter data

ess_tweets2 <- twitter_resp2 |> 
  resp_body_json()


entries2 <- pluck(ess_tweets, "data", "user", "result", "timeline_v2", "timeline", "instructions", 2L, "entries")

tweets2 <- map(entries, function(x) pluck(x, "content", "itemContent", "tweet_results", "result", "legacy"))

tweets_df2 <- map(tweets2, function(t) {
  tibble(
    id = t$id_str,
    user_id = t$user_id_str,
    created_at = t$created_at,
    full_text = t$full_text,
    favorite_count = t$favorite_count,
    retweet_count = t$retweet_count,
    bookmark_count = t$bookmark_count
  )
}) |> 
  bind_rows()
tweets_df
# A tibble: 13 × 7
   id                  user_id created_at full_text favorite_count retweet_count
   <chr>               <chr>   <chr>      <chr>              <int>         <int>
 1 1818577938222141854 152501… Wed Jul 3… "Join to…              7             2
 2 1818267556605505645 152501… Tue Jul 3… "📢#ESS2…              2             0
 3 1817969272951619976 152501… Mon Jul 2… "Getting…              5             2
 4 1817840304487104716 152501… Mon Jul 2… "Mark yo…              9             4
 5 1816468781277065463 152501… Thu Jul 2… "Congrat…             23             2
 6 1816105085648158748 152501… Wed Jul 2… "Join us…              6             2
 7 1815349232217305448 152501… Mon Jul 2… "Welcome…             27             4
 8 1814324981129462000 152501… Fri Jul 1… "Our cou…              2             1
 9 1814321604815433978 152501… Fri Jul 1… "Interes…              0             0
10 1813885852000424274 152501… Thu Jul 1… "Would y…              0             0
11 1813535484532125895 152501… Wed Jul 1… "RT @Ess…              0             2
12 1812805529590218988 152501… Mon Jul 1… "Only on…              1             2
13 1812795777372082330 152501… Mon Jul 1… "RT @DrD…              0             7
# ℹ 1 more variable: bookmark_count <int>

Mission failure

I stopped at this point since there are three issue that are unclear to resolve:

  1. How do we get the “cursor” value to keep scrolling?
  2. We have to send several identifiers
  3. It is not clear how stable x-csrf-token, authorization, and the cookies are

Summary: hidden APIs

What are they

  • used by services of a company to communicate with each other
  • code on a website often uses one to download additional conent
  • the browser logs them and provides them to us as cURL calls

What are they good for?

  • We can often use them to get content that is otherwise unavailable
  • We can study them to find out what requests the website server accepts
  • Some websites allow access just using a special header or cookies
  • If they are somewhat flexible we can wrap them in a function or package
  • This can allow us to gather data on scale

Issues

  • Companies have mechanisms to counter scraping:
    • signing specific requests (TikTok)
    • obscuring pagination (Twitter)
    • rate limiting requests per second/minute/day and user/IP(Twitter)
    • expiring session tokens (telegraaf.nl)

Wrap Up

Save some information about the session for reproducibility.

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rvest_1.0.4        httr2_1.0.1        lubridate_1.9.3    forcats_1.0.0     
 [5] stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2        readr_2.1.5       
 [9] tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0   
[13] tinytable_0.3.0.10

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3    utf8_1.2.4        generics_0.1.3    xml2_1.3.6       
 [5] stringi_1.8.4     hms_1.1.3         digest_0.6.35     magrittr_2.0.3   
 [9] evaluate_0.23     grid_4.4.1        timechange_0.3.0  fastmap_1.1.1    
[13] lobstr_1.1.2      jsonlite_1.8.8    processx_3.8.4    chromote_0.2.0   
[17] ps_1.7.7          promises_1.3.0    httr_1.4.7        selectr_0.4-2    
[21] fansi_1.0.6       scales_1.3.0      cli_3.6.3         crayon_1.5.2     
[25] rlang_1.1.4       docopt_0.7.1      munsell_0.5.1     withr_3.0.0      
[29] yaml_2.3.8        tools_4.4.1       tzdb_0.4.0        colorspace_2.1-0 
[33] curl_5.2.1        vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[37] pkgconfig_2.0.3   pillar_1.9.0      later_1.3.2       gtable_0.3.5     
[41] glue_1.7.0        Rcpp_1.0.12       xfun_0.44         tidyselect_1.2.1 
[45] rstudioapi_0.16.0 knitr_1.46        websocket_1.4.1   htmltools_0.5.8.1
[49] rmarkdown_2.26    compiler_4.4.1